# Reconnected InChI-to-IUPAC Dataset (v2.0.0)
 
**Version:** v2.0.0  
**Authors:** Banula Perera (University of Westminster) 
**Repository:** Harvard Dataverse
**License / Terms of Use:** CC0 1.0

---

## 1. Overview

This dataset provides paired **InChI → IUPAC systematic name** examples for metal-containing chemistry, with a focus on **Reconnected InChI** representations designed to better preserve metal–ligand connectivity compared to standard InChI strings.

Each record contains:
- a `standard_inchi` string,
- a `reconnected_inchi` string (post-processed / reconnected),
- a target `iupac_name`,
- categorical metadata (`category`, `has_metal`, `primary_metal`),
- and derived string length fields (`*_len`) for modeling and filtering.

The dataset is intended for training and evaluating InChI-to-IUPAC translation systems (e.g., transformer seq2seq models) and for benchmarking validity-aware post-processing pipelines.

---

## 2. Files in this Dataverse record

**Data**
- `metano_inchi_to_iupac_reconnected_v2.csv`  
  Main cleaned dataset (CSV, UTF-8). One row per compound.

**Documentation**
- `README.md` (this file)
- `data_dictionary.csv`  
  Column-level schema and definitions.

**Optional**
- `code.zip`
  Scripts used to generate, validate, clean, and deduplicate the dataset, plus `requirements.txt`.

---

## 3. Data schema (columns)

Core fields:
- `standard_inchi` — Original/standard InChI string (must start with `InChI=`).
- `reconnected_inchi` — Reconnected InChI representation (post-processed to better preserve metal–ligand relations).
- `iupac_name` — Target IUPAC systematic name string.
- `category` — High-level chemistry category label (e.g., `COORDINATION`, `ORGANOMETALLIC`, `INORGANIC`, `ORGANIC`).
- `has_metal` — Boolean flag indicating detection of a metal element in the compound (based on a configured metal element list).
- `primary_metal` — The primary metal element symbol (e.g., `Fe`, `Cu`). Empty/NA when `has_metal=false`.

Derived fields (character counts, computed after trimming whitespace):
- `standard_inchi_len` — `len(standard_inchi)`
- `reconnected_inchi_len` — `len(reconnected_inchi)`
- `iupac_len` — `len(iupac_name)`

See `data_dictionary.csv` for detailed definitions, examples, and allowed values.

---

## 4. Data cleaning and validation

The released dataset is **deduplicated and filtered** to remove invalid or unusable records. The typical cleaning pipeline is:

### 4.1 Basic normalization
- Trim leading/trailing whitespace from all string fields.
- Normalize internal whitespace in `iupac_name`.
- Ensure UTF-8 encoding.

### 4.2 Validity filters (invalid row removal)
Rows are removed if any of the following holds:

**Missing / empty fields**
- `standard_inchi` is missing/empty  
- `reconnected_inchi` is missing/empty  
- `iupac_name` is missing/empty

**InChI format checks**
- `standard_inchi` does not start with `InChI=`

**Obvious placeholder / corrupted IUPAC names**
- `iupac_name` is a placeholder like `"-"`, `"?"`, `"??"`, `"n/a"`, `"unknown"`, or matches a pattern of only punctuation (e.g., `----`, `???`).

### 4.3 CSV integrity check
Because IUPAC strings may contain commas, quotes, and brackets, the final CSV is exported with robust quoting/escaping to ensure it is parseable by standard CSV readers.

---

## 5. Deduplication policy

Duplicates can occur when the same compound/name pair appears multiple times across upstream sources or preprocessing steps.

**Exact-duplicate removal**
- Duplicate rows are removed after normalization using an **exact-match key**.

**Recommended deduplication key (used in our scripts by default)**
- (`reconnected_inchi`, `iupac_name`, `category`)

Rationale:
- `reconnected_inchi` serves as the most structure-preserving representation for this dataset.
- `category` is included to avoid unintended merging if the same text appears under different category labels.

---

## 6. Intended use and limitations

**Intended use**
- Training and evaluation of sequence-to-sequence translation models from InChI to IUPAC.
- Benchmarking validity checks, repair loops, and neuro-symbolic post-processing.

**Limitations**
- IUPAC names may have legitimate stylistic variations; exact-match evaluation can underestimate chemical correctness.
- Reconnected InChI behavior depends on the reconnection procedure and tool versions.
- Category labels depend on the categorization logic and metal list used.

---

## 7. Contact

For questions, issues, or requests:
- **Contact:** Banula Perera
- **Affiliation:** University of Westminster, UK
- **Email:** banulalakidu59@gmail.com

---

## 8. Changelog

- **v1.0.0** — Initial public release.
- **v2.0.0** - Increased the dataset size.
